A probabilistic approach to printed document understanding
Identifieur interne : 000579 ( Main/Exploration ); précédent : 000578; suivant : 000580A probabilistic approach to printed document understanding
Auteurs : Eric Medvet [Italie] ; Alberto Bartoli [Italie] ; Giorgio Davanzo [Italie]Source :
- International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2011.
Descripteurs français
- Pascal (Inist)
- Interprétation image, Analyse documentaire, Extraction information, Gestion contenu, Interface utilisateur, Reconnaissance caractère, Reconnaissance optique caractère, Recherche information, Traitement document, Document imprimé, Clic, Brevet, Propriété industrielle, Valorisation, Approche probabiliste, Maximum vraisemblance, Modélisation.
- Wicri :
- topic : Brevet, Propriété industrielle.
English descriptors
- KwdEn :
- Character recognition, Click, Content management, Document analysis, Document processing, Image interpretation, Information extraction, Information retrieval, Maximum likelihood, Modeling, Optical character recognition, Patent rights, Patents, Printed document, Probabilistic approach, Upgrading, User interface.
Abstract
We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results- e.g., a success rate often greater than 90% even for classes with just two samples.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000106
- to stream PascalFrancis, to step Curation: 000666
- to stream PascalFrancis, to step Checkpoint: 000133
- to stream Main, to step Merge: 000585
- to stream Main, to step Curation: 000579
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">A probabilistic approach to printed document understanding</title>
<author><name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">12-0083104</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 12-0083104 INIST</idno>
<idno type="RBID">Pascal:12-0083104</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000106</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000666</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000133</idno>
<idno type="wicri:doubleKey">1433-2833:2011:Medvet E:a:probabilistic:approach</idno>
<idno type="wicri:Area/Main/Merge">000585</idno>
<idno type="wicri:Area/Main/Curation">000579</idno>
<idno type="wicri:Area/Main/Exploration">000579</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">A probabilistic approach to printed document understanding</title>
<author><name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Click</term>
<term>Content management</term>
<term>Document analysis</term>
<term>Document processing</term>
<term>Image interpretation</term>
<term>Information extraction</term>
<term>Information retrieval</term>
<term>Maximum likelihood</term>
<term>Modeling</term>
<term>Optical character recognition</term>
<term>Patent rights</term>
<term>Patents</term>
<term>Printed document</term>
<term>Probabilistic approach</term>
<term>Upgrading</term>
<term>User interface</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Interprétation image</term>
<term>Analyse documentaire</term>
<term>Extraction information</term>
<term>Gestion contenu</term>
<term>Interface utilisateur</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Recherche information</term>
<term>Traitement document</term>
<term>Document imprimé</term>
<term>Clic</term>
<term>Brevet</term>
<term>Propriété industrielle</term>
<term>Valorisation</term>
<term>Approche probabiliste</term>
<term>Maximum vraisemblance</term>
<term>Modélisation</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Brevet</term>
<term>Propriété industrielle</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results- e.g., a success rate often greater than 90% even for classes with just two samples.</div>
</front>
</TEI>
<affiliations><list><country><li>Italie</li>
</country>
</list>
<tree><country name="Italie"><noRegion><name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
</noRegion>
<name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000579 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000579 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:12-0083104 |texte= A probabilistic approach to printed document understanding }}
This area was generated with Dilib version V0.6.32. |